Goto

Collaborating Authors

 map element


Supplementary Materials Online Map Vectorization for Autonomous Driving: A Rasterization Perspective

Neural Information Processing Systems

The base model takes surround-view images of the ego-vehicle as input. As shown in Figure 1, we provide further visual comparisons of HD map vectorization results. The results reaffirm the necessity of a rasterization perspective in map vectorization. Figure 1 presents more visualization of MapVR's HD map construction results. As discussed in Section 3, the Chamfer-distance-based metric struggles to offer a fair evaluation for such scenarios.


Online Map Vectorization for Autonomous Driving: A Rasterization Perspective

Neural Information Processing Systems

MapVR (Map V ectorization via Rasterization), a novel framework that applies differentiable rasterization to vectorized outputs and then performs precise and geometry-aware supervision on rasterized HD maps.


Unveiling the Hidden: Online Vectorized HD Map Construction with Clip-Level Token Interaction and Propagation

Neural Information Processing Systems

Predicting and constructing road geometric information (e.g., lane lines, road markers) is a crucial task for safe autonomous driving, while such static map elements can be repeatedly occluded by various dynamic objects on the road. Recent studies have shown significantly improved vectorized high-definition (HD) map construction performance, but there has been insufficient investigation of temporal information across adjacent input frames (i.e., clips), which may lead to inconsistent and suboptimal prediction results. To tackle this, we introduce a novel paradigm of clip-level vectorized HD map construction, MapUnveiler, which explicitly unveils the occluded map elements within a clip input by relating dense image representations with efficient clip tokens.


Toward Efficient and Robust Behavior Models for Multi-Agent Driving Simulation

Konstantinidis, Fabian, Sackmann, Moritz, Hofmann, Ulrich, Stiller, Christoph

arXiv.org Artificial Intelligence

Scalable multi-agent driving simulation requires behavior models that are both realistic and computationally efficient. We address this by optimizing the behavior model that controls individual traffic participants. To improve efficiency, we adopt an instance-centric scene representation, where each traffic participant and map element is modeled in its own local coordinate frame. This design enables efficient, viewpoint-invariant scene encoding and allows static map tokens to be reused across simulation steps. To model interactions, we employ a query-centric symmetric context encoder with relative positional encodings between local frames. We use Adversarial Inverse Reinforcement Learning to learn the behavior model and propose an adaptive reward transformation that automatically balances robustness and realism during training. Experiments demonstrate that our approach scales efficiently with the number of tokens, significantly reducing training and inference times, while outperforming several agent-centric baselines in terms of positional accuracy and robustness.


FRIEDA: Benchmarking Multi-Step Cartographic Reasoning in Vision-Language Models

Pyo, Jiyoon, Jiao, Yuankun, Jung, Dongwon, Li, Zekun, Jang, Leeje, Kirsanova, Sofia, Kim, Jina, Lin, Yijun, Liu, Qin, Xie, Junyi, Askari, Hadi, Xu, Nan, Chen, Muhao, Chiang, Yao-Yi

arXiv.org Artificial Intelligence

Cartographic reasoning is the skill of interpreting geographic relationships by aligning legends, map scales, compass directions, map texts, and geometries across one or more map images. Although essential as a concrete cognitive capability and for critical tasks such as disaster response and urban planning, it remains largely unevaluated. Building on progress in chart and infographic understanding, recent large vision language model studies on map visual question-answering often treat maps as a special case of charts. In contrast, map VQA demands comprehension of layered symbology (e.g., symbols, geometries, and text labels) as well as spatial relations tied to orientation and distance that often span multiple maps and are not captured by chart-style evaluations. To address this gap, we introduce FRIEDA, a benchmark for testing complex open-ended cartographic reasoning in LVLMs. FRIEDA sources real map images from documents and reports in various domains and geographical areas. Following classifications in Geographic Information System (GIS) literature, FRIEDA targets all three categories of spatial relations: topological (border, equal, intersect, within), metric (distance), and directional (orientation). All questions require multi-step inference, and many require cross-map grounding and reasoning. We evaluate eleven state-of-the-art LVLMs under two settings: (1) the direct setting, where we provide the maps relevant to the question, and (2) the contextual setting, where the model may have to identify the maps relevant to the question before reasoning. Even the strongest models, Gemini-2.5-Pro and GPT-5-Think, achieve only 38.20% and 37.20% accuracy, respectively, far below human performance of 84.87%. These results reveal a persistent gap in multi-step cartographic reasoning, positioning FRIEDA as a rigorous benchmark to drive progress on spatial intelligence in LVLMs.



SDTagNet: Leveraging Text-Annotated Navigation Maps for Online HD Map Construction

Immel, Fabian, Pauls, Jan-Hendrik, Fehler, Richard, Bieder, Frank, Merkert, Jonas, Stiller, Christoph

arXiv.org Artificial Intelligence

Autonomous vehicles rely on detailed and accurate environmental information to operate safely. High definition (HD) maps offer a promising solution, but their high maintenance cost poses a significant barrier to scalable deployment. This challenge is addressed by online HD map construction methods, which generate local HD maps from live sensor data. However, these methods are inherently limited by the short perception range of onboard sensors. To overcome this limitation and improve general performance, recent approaches have explored the use of standard definition (SD) maps as prior, which are significantly easier to maintain. We propose SDTagNet, the first online HD map construction method that fully utilizes the information of widely available SD maps, like OpenStreetMap, to enhance far range detection accuracy. Our approach introduces two key innovations. First, in contrast to previous work, we incorporate not only polyline SD map data with manually selected classes, but additional semantic information in the form of textual annotations. In this way, we enrich SD vector map tokens with NLP-derived features, eliminating the dependency on predefined specifications or exhaustive class taxonomies. Second, we introduce a point-level SD map encoder together with orthogonal element identifiers to uniformly integrate all types of map elements. Experiments on Argoverse 2 and nuScenes show that this boosts map perception performance by up to +5.9 mAP (+45%) w.r.t. map construction without priors and up to +3.2 mAP (+20%) w.r.t. previous approaches that already use SD map priors. Code is available at https://github.com/immel-f/SDTagNet



Supplementary Materials Online Map Vectorization for Autonomous Driving: A Rasterization Perspective

Neural Information Processing Systems

The base model takes surround-view images of the ego-vehicle as input. As shown in Figure 1, we provide further visual comparisons of HD map vectorization results. The results reaffirm the necessity of a rasterization perspective in map vectorization. Figure 1 presents more visualization of MapVR's HD map construction results. As discussed in Section 3, the Chamfer-distance-based metric struggles to offer a fair evaluation for such scenarios.